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WHAT IS CLAIMED IS: 

1. A method for determining whether documents, in a large 
collection of documents, are near-duplicates, the method 
comprising: 

a) for each of at least some of the documents in the 
large collection of documents, generating at least two 
fingerprints; 

b) preprocessing the fingerprints to identify any 
fingerprints that are associated with only one 
document; and 

c) determining whether or not documents are 
near-duplicate documents based on fingerprints other 
than those identified as being associated with only 
one document . 

2. The method of claim 1 wherein the act of determining 
whether or not documents are near-duplicate documents 
includes : 



i) for any two documents, determining whether or 
not any fingerprints of a first of the two 
documents matches any fingerprints of a second of 
the two documents, and 

ii) if it is determined that a fingerprint of 
the first of the two documents does match a 
fingerprint of the second of the two documents, 
then concluding that the two documents are 
near-duplicates . 



3. The method of claim 1 wherein the act of generating at 
least two fingerprints for each of the documents includes: 
i) extracting parts from the document. 
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ii) hashing each of the extracted parts to 
generate a hash value for each of the extracted 
parts, 

iii) populating a predetermined number of lists 
with the extracted parts based on their 
respective hash values, and 

iv) for each of the predetermined number of 
lists, determining a fingerprint based on the 
contents of the list. 



4 . The method of claim 3 wherein the act of hashing each 
of the extracted parts to generate a hash value for each of 
the extracted parts uses a hash function that is 
repeatable, deterministic and not sensitive to state. 

5. The method of claim 3 wherein the parts extracted from 
the document are selected from a group of parts consisting 
of characters, words, sentences, paragraphs and sections. 

6. The method of claim 3 wherein the parts extracted from 
the document do not overlap. 

7. The method of claim 3 wherein the parts extracted from 
the document overlap. 

8. The method of claim 3 wherein each of the acts of 
determining a fingerprint uses a hashing function with a 
low probability of collision, 

9. The method of claim 3 wherein the act of determining a 
fingerprint uses a function that is sensitive to an order 
of the parts within a list. 
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10. The method of claim 3 wherein the act of determining a 
fingerprint uses a function that is insensitive to an order 
of the parts within a list. 

11. An apparatus for determining whether documents, in a 
large collection of documents, are near-duplicates, the 
apparatus comprising: 

a) a fingerprint generator for generating, for each 
of the documents in the large collection of documents, 
at least two fingerprints; 

b) a preprocessor for identifying any fingerprints 
that are associated with only one document; and 

c) a fingerprint comparison facility for determining 
whether or not documents are near-duplicate documents 
based on fingerprints other than those identified as 
being associated with only one document. 

12. The apparatus of claim 11 wherein the fingerprint 
generator includes : 



i) an extractor for extracting parts from the 
document, 

ii) a hashing facility for hashing each of the 
extracted parts to generate a hash value for each 
of the extracted parts, 

iii) list population facility for populating a 
predetermined number of lists with the extracted 
parts based on their respective hash values, and 

iv) means for determining a fingerprint for each 
of the predetermined number of lists, based on 
the contents of the list. 
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13. A method for clustering documents, the method 
comprising: 

a) for each of the documents, generating at least two 
fingerprints; and 

b) for each of the documents. 



i) determining whether or not the document is a 
near-duplicate of any of previously processed 
documents, based on fingerprints of the 
documents, 

ii) if it is determined that the document is not 
a near-duplicate of any previously processed 
document, then associating the document with a 
unique cluster identifier, and 

iii) if it is determined that the document is a 
near-duplicate of a previously processed 
document, then associating the document with a 
cluster identifier associated with the previously 
processed document. 



14. A methbd for filtering search results to remove 
near-duplicates, the method comprising: 

a) for each of a predetermined number of candidate 
search results, determining whether the candidate 
search result ifts a ne^r-duplicate of another candidate 
search resijlt ;\ and/ 

b) if it ia deAeWnined that the candidate search 
result is a near^duplicate of another candidate search 
result, then rejecting the candidate search result. 

15. The method of claiin 14 wherein the act of determining 
whether a candidate search result is a near-duplicate of 
another candidate search result includes 
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4 i) comparing a cluster identifier of the 

5 candidate search result with that of the other 

6 candidate! sekxc^ result, and 

7 ii) if tn|ej^fl^ter identifiers of the two 

8 candidate search results match, then concluding 

9 that the two candidate search results are 
10 near-duplicaites . 

^^^"^e.v The method of claim 15 wherein cluster identifiers of 
2^ the\andidate search results are assigned by: 

3 \ i) determining whether or not a document 

4 \ corresponding to the candidate search result is a 

g«l 5 \ near-duplicate of any of previously processed 

'■^ 6 Ndocuments, 

SJ \ 

ill 7 iV) if it is determined that the document 

fTl \ 

8 colnresponding to the candidate search result is 

«P 9 not \a near-duplicate of any previously processed 

H \ 

J. 10 document, then associating the document with a 

1=^11 uniqueXcluster identifier, and 

I'Ull iii) iA it is determined that the document 

j5l3 corresponcding to the candidate search result is a 

Nl4 near-duplioate of a previously processed 

15 document, th^n associating the document 

16 correspondingVo the candidate search result with 

17 a cluster identifier associated with the 

18 previously processed document. 

1 1^. A method for determining whether two documents are 

2 near-duplicates, the method comprising: 

3 a) for each of the two documents, generating at least 

4 two fingerprints by 

5 i) extracting parts from the document, 

''0 
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6 ii) hashing each of the extracted parts to 

7 generate a hash value for each of the extracted 

8 parts, 

9 iii) populating at least two lists with the 

10 extracted parts based on their respective hash 

11 values, and 

12 iv) for each of the predetermined number of 

13 lists, determining a fingerprint based on the 

14 contents of the list; and 

15 b) determining whether or not the two documents are 

16 near-duplicate documents based on their fingerprints. 

. The method of claim Vf wherein the act of determining 
whether or not the two documents are near-duplicate 
documents includes : 

i) determining whether or not any fingerprints 
of a first of the two documents matches any 
fingerprints of a second of the two documents, 
and 

ii) if it is determined that a fingerprint of 
the first of the two documents does match a 
fingerprint of the second of the two documenta, 
then concluding that the two documents are 
near-duplicates . 

1 1^. The method of claim ^ wherein the act of hashing each 

2 of the extracted parts to generate a hash value for each of 

3 the extracted parts uses a hash function that is 

4 repeatable, deterministic and not sensitive to state. 

1 The method of claim Jj^ wherein the parts extracted 

2 from the document are selected from a group of parts 
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3 consisting of characters, words, sentences, paragraphs and 

4 sections. 

1 2r. The method of claim ^ wherein the parts extracted 

2 from the document do not overlap. 



1 The method of claim wherein the parts extracted 

2 from the document overlap. 

1 ^26. The method of claim .1^ wherein the act of determining 

2 a fingerprint uses a hashing function with a low 

3 probability of collision. 



9 V 



1 'jS'f. The method of claim A^'T wherein the act of determining 



2 a fingerprint uses a function that is sensitive to an order 

3 of the parts within a list. 



Si 1 . The method of claim 1^7-^ wherein the act of determining 

2 a fingerprint uses a function that is insensitive to an 

^^4 3 order of the parts within a list. 
^ 1 

1 A method, for use in a crawling facility, for reducing 

2 processing and bandwidth used, the method comprising: 

3 a) for each of the documents, generating at least two 

4 fingerprints by 

5 i) extracting parts from the document, 

6 ii) hashing each of the extracted parts to 

7 generate a hash value for each of the extracted 

8 parts, 

9 iii) populating at least two lists with the 

10 extracted parts based on their respective hash 

11 values, and 
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12 iv) for each of the predetermined number of 

13 lists, determining a fingerprint based on the 

14 contents of the list; 

15 b) determining whether or not the two documents are 

16 near-duplicate documents based on their fingerprints; 

17 and 

18 c) if it is determined that the two documents are 

19 near-duplicates, then indicating that one of the two 

20 documents is not to be processed during a subsequent 

21 crawl. 

1 3^. A method for treating broken links to document, the 
p 2 method comprising: 

!,"s 3 a) determining whether a link to a first document is 

4 broken; 

m 

-.fi 5 b) if it is determined that a link to a first 

'j^. 6 document is broken, determining whether there exists a 

Si 7 second document that is a near-duplicate of the first 

H 8 document; and 

9 c) if it is determined that there exists a second 

p 10 document that is a near-duplicate of the first 

11 document, then replacing the broken link to the first 

12 document with a link to the second document, 

13 wherein the act of determining whether or not 

14 there exists a second document is a near-duplicate of the 

15 first document is performed by: 

16 i) for each of the documents, generating at 

17 least two fingerprints by 

18 A) extracting parts from the document, 

19 B) hashing each of the extracted parts to 

20 generate a hash value for each of the 

21 extracted parts. 
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22 C) populating at least two lists with the 

23 extracted parts based on their respective 

24 hash values, and 

25 D) for each of the predetermined number of 

26 lists, determining a fingerprint based on 

27 the contents of the list; and 

28 ii) determining whether or not the two documents 

29 are near-duplicate documents based on their 

30 fingerprints. 



1^. 



J^. An apparatus for determining whether two documents are 
2 near-duplicates, the apparatus comprising: 

Q 3 a) a fingerprint generator for generating at least 

4 two fingerprints for each of the two documents, the 

P 5 fingerprint generator including 

^2 6 i) an extractor for extracting parts from the 

'^■^ 7 document, 

8 ii) a hashing facility for hashing each of the 

9 extracted parts to generate a hash value for each 
111 10 of the extracted parts, 

Q 11 iii) a list population facility for populating 

^'"^ 12 at least two lists with the extracted parts based 

13 on their respective hash values, and 

14 iv) means for determining, for each of the 

15 predetermined number of lists, a fingerprint 

16 based on the contents of the list; and 

17 b) a comparison facility for determining whether or 

18 not the two documents are near-duplicate documents 

19 based on their fingerprints. 



A, 



1 An improved crawling facility, for reducing processing 

2 and bandwidth used, the crawling facility comprising: 
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a) a fingerprint generator for generating, for each 
of the documents, at least two fingerprints, the 
fingerprint generator including 

i) an extractor for extracting parts from the 
document, 

ii) a hashing facility for hashing each of the 
extracted parts to generate a hash value for each 
of the extracted parts, 

iii) a list population facility for populating 
at least two lists with the extracted parts based 
on their respective hash values, and 

iv) means for determining, for each of the 
predetermined number of lists, a fingerprint 
based on the contents of the list; 

b) a comparison facility for determining whether or 
not the two documents are near-duplicate documents 
based on their fingerprints; and 

c) a document processor, wherein if it is determined 
that the two documents are near-duplicates, then the 
document processor indicates that one of the two 
documents is not to be processed during a subsequent 
crawl . 

30. A search fMter for processing search results to 
remove near-duplicates, the search filter comprising: 

a) a near-duplifcate determination facility for 
determining, VforiVach of a predetermined number of 
candidate seaAcaVra^ults, whether the candidate search 
result is a nearAoiuplicate of another candidate search 
result; and \ ^ 

b) a filter f orXrejecting the candidate search result 
if it is determined that the candidate search result 
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9 document dldentified by the document identifier stored 

10 in the fiTit field. 



1 ^4. A machine-readable medium having stored thereon 



2 machine-executable instructions which, when executed by a 

3 machine: 

4 a) extract parts from a document, 

5 ii) hash each of the extracted parts to generate a 

6 hash value for each of the extracted parts, 

7 iii) populate a predetermined number of lists with 

8 the extracted parts based on their respective hash 

9 values, and^ 
p 10 iv) for each pf t]/e predetermined number of lists, 

r 



C\ll determine a Vfih^erprint based on the contents of the 



mi2 list. 

1 ^ method for generating at least two fingerprints for 

Si 2 a document comprising: 

O 

j^u 3 a) extracting parts from the document; 

PJ 4 b) hashing each of the extracted parts to generate a 

■M. 

p 5 hash value for each of the extracted parts; 

6 c) populating a predetermined number of lists with 

7 the extracted parts based on their respective hash 

8 values; and 

9 d) for each of the predetermined number of lists, 

10 determining a fingerprint based on the contents of the 

11 list. 

1 yS. The method of claim^^;^ wherein each of the lists has 

2 an associated hashing function, 
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wherein each of the extracted parts can be contained 
in none of the lists, one of the lists, or more of the 
lists based on the hash functions for the lists. 



*^f. The method of claim wherein for each hash function 
is dynamically adjusted such that the probability that the 
hash function will populate its associated list with a part 
decreases as the size of the document increases. 



3»8 . A method comprising: 

a) determining whether there exists a second document 
that is a near-duplicate of a first document; and 

b) indexing the first document but not the second 
document, 

wherein the act of determining whether or not 
there exists a second document is a near-duplicate of the 
first document is performed by: 



i) for each of the documents, generating at 
least two fingerprints by 

A) extracting parts from the document, 

B) hashing each of the extracted parts to 
generate a hash value for each of the 
extracted parts, 

C) populating at least two lists with the 
extracted parts based on their respective 
hash values, and 

D) for each of the predetermined number of 
lists, determining a fingerprint based on 
the contents of the list; and 

ii) determining whether or not the two documents 
are near-duplicate documents based on their 
fingerprints , 
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1 39. A method for determining whether two documents are 

2 near-duplicates, vthe method comprising: 

3 a) for each\of the two documents, generating at least 

4 two f ingerpriinlts; and 

5 b) determiniHKA w]>^her or not the two documents are 

6 near-duplicat4Ldbcuments by 

7 i) dete^ini^ whether or not any one of the 

8 fingerprints of a first of the two documents 

9 matches any one of the fingerprints of a second 

10 of the two! documents, and 

11 ii) if it\is determined that any one fingerprint 

Q 12 of the first of the two documents does match any 

i,n \ 

13 one fingerprint of the second of the two 

p 14 documents, tVien concluding that the two documents 

iT\ \ 
^2 15 are near-dup]\icates . 

g;^ 1 A method for determining whether two objects are 

2 near-duplicates, the method comprising: 

3 a) for each of the two objects, generating at least 
p 4 two fingerprints by 

5 i) extracting features from the object, 

6 ii) hashing each of the extracted features to 

7 generate a hash value for each of the extracted 

8 features, 

9 iii) populating at least two lists with the 

10 extracted features based on their respective hash 

11 values, and 

12 iv) for each of the predetermined number of 

13 lists, determining a fingerprint based on the 

14 contents of the list; and 
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15 b) determining whether or not the two objects are 

16 near-duplicates based on their fingerprints. 

1 The method of claim ^4*0* wherein each of the two objects 

2 is a word, and 

3 wherein the extracted features define context vectors. 

1 The method of claim wherein each of the two objects 

2 is a word, and 

3 wherein, in each case, the extracted features are 

4 words that frequently occur in close proximity to the word. 

p 1 J^. The method of claim wherein the two objects are 

;^ 2 words, and 

IP 3 wherein if the two objects are determined to be near 

p 

4 duplicates, then determining the two words to be synonyms. 

aj 1 4i4r A method for determining whether a first document and 

g'^ 2 a second document in a collection of documents are 



lU 3 near-duplicates, the method comprising: 

131 4 a) for each of the documents in the collection of 

'■^ 5 documents, generating at least two fingerprints; and 

6 b) concluding that the first and second documents are 

7 near-duplicates if any one of the at least two 

8 fingerprints of the first document matches any one of 

9 the at least two fingerprints of the second document, 

10 wherein documents in the collection of documents 

11 without any common fingerprints are not checked to 

12 determine whether or not they are near duplicates. 

1 The method of claim^^^^^^ further comprising: 
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a2) for each of the documents in the collection of 
documents, generating a document-fingerprint pair for 
each of the at least two fingerprints; and 
a3) sorting the fingerprint-document pairs based on 
values of the fingerprints. 
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is a near-duplicate of another candidate search 
result . 



1 31. The search filter of claim 30 wherein the 

2 near-duplicate determination facility includes a comparison 

3 facility for cpmparing a cluster identifier of the 

4 candidate search result with that of another candidate 

5 search result, land wherein if the cluster identifiers of 

6 the two candidate search results match, then it is 

7 concluded that ihe two candidate search results are 

8 near-duplicates , 
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32, A machine-re^daf^le medium having stored thereon a 
plurality of records J \e§,ch of the records comprising: 

a) a first f^ey\5t\for storing a document identifier; 
and 

b) a plurality of lists, each of the plurality of 
lists containing elements of a document identified by 
the document identifier stored in the first field, 

wherein a h^sh function is used to determine 
which of the plurality of lists each of the elements will 
be contained in. 

33. A machine-readablje medium having stored thereon a 
plurality of records, each of the records comprising: 

a) a first field \f or storing a document identifier; 
and 

b) a plurality of \f ingerprints, wherein each of the 
fingerprints is a low collision probability hash 
function of elements\ contained in a corresponding 
list, and wherein the\ elements are elements of a 
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